Time Series Characteristics

Features

A good overview: https://cran.r-project.org/web/packages/tsfeatures/vignettes/tsfeatures.html

Datasets

List of public time series datasets: https://github.com/awesomedata/awesome-public-datasets#timeseries

Don't forget to cite the UCR archive:

@misc{UCRArchive2018,
    title = {The UCR Time Series Classification Archive},
    author = {Dau, Hoang Anh and Keogh, Eamonn and Kamgar, Kaveh and Yeh, Chin-Chia Michael and Zhu, Yan 
              and Gharghabi, Shaghayegh and Ratanamahatana, Chotirat Ann and Yanping and Hu, Bing 
              and Begum, Nurjahan and Bagnall, Anthony and Mueen, Abdullah and Batista, Gustavo},
    year = {2018},
    month = {October},
    note = {\url{https://www.cs.ucr.edu/~eamonn/time_series_data_2018/}}
}

Legend 🟩🟨🟥

  • 🟩 Feature is simple and easy to compute
  • 🟨 Feature is only relevant for unscaled, raw data (otherwise, it is constant and meaningless)
  • 🟥 Feature is rather complex to compute (either in terms of the formula or in terms of computational resources)
In [1]:
%reload_ext autoreload
%autoreload 2

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import preprocessing
from time_series_characteristics import TimeSeriesCharacteristics, NormalizedTimeSeriesCharacteristics

import util

Input

Specify the following two variables:

  • UCR_ARCHIVE: The path to the UCR archive main directory (contains the dataset folders).
  • UCR_DATASET: The name of the UCR dataset to load. Per default, the TEST-dataset will be loaded.
In [2]:
UCR_ARCHIVE = INSERT_PATH_TO_UCR_MAIN_DIR_HERE
UCR_DATASET = "Rock"
In [3]:
def read_ucr_file(path):
    """Read the UCR file from the specified path into a DataFrame with a multi-index ('entity', 'time') and a single column 'value'"""
    df = pd.read_csv(path, sep="\t", header=None, dtype=np.float64)
    entity_index = [f"{os.path.split(path)[1]}_{i}" for i in df.index]
    df.index = pd.MultiIndex.from_tuples([(i,) for i in entity_index], names=["entity"])
    x, y = df.iloc[:, 1:].copy(), df.iloc[:, 0].astype(np.int32, copy=False)
    x[:] = preprocessing.scale(x, axis=1)
    
    # add entity "index" which is just a counter for ucr
    x["entity"] = entity_index

    # un-pivot the table and use the column-wise timestamps as an index instead
    x = pd.melt(x, id_vars=["entity"], value_name="value", var_name="time")

    # transform int-based time values to datetime (-1 because pandas labelled the first data column with 1 instead of 0)
    x["time"] = pd.to_datetime(x["time"] - 1, unit="m")

    # set new multi-index
    x.set_index(["entity", "time"], inplace=True, verify_integrity=True)
    x.sort_index(inplace=True)
    
    return x, y

def calculate_min_max_df(standardized_df, features_to_normalize, quantiles, index_levels, column_value, verbose):
    """Given a standardized DataFrame, calculate the quantiles for the specified features"""
    features_to_normalize_df = util.get_feature_combination_df(
        features_to_normalize,
        standardized_df,
        index_levels=index_levels,
        column_value=column_value,
        verbose=verbose)
    min_max_df = features_to_normalize_df.quantile(quantiles).T
    min_max_df.columns = [str(c) for c in min_max_df.columns]
    return min_max_df

def random_series(dtindex=False, plot=True, ax=None, random_state=None):
    """Get a random time series from the archive"""
    rnd = np.random.RandomState(random_state)
    random_entity = rnd.choice(DATA.index.get_level_values("entity"))
    sample = DATA.loc[random_entity]["value"].values
    if plot:
        noaxis = ax is None
        if noaxis:
            plt.figure(figsize=(4, 1))
            ax = plt.gca()
        ax.plot(sample)
        if noaxis:
            plt.show()
        
    if dtindex:
        sample = pd.Series(sample, index=pd.to_datetime(list(range(len(sample))), unit="m"), name="value").asfreq("1min")
    return sample

def apply(fun, nplots=10, extra=None, dtindex=False, **kwargs):
    """Apply function fun on a random sample of 10 time series and plot them along with their statistics"""
    _, axs = plt.subplots(1, nplots, figsize=(30, 1))
    for i, ax in enumerate(axs):
        ts = random_series(ax=ax, dtindex=dtindex, random_state=i)
        val = fun(ts, **kwargs)
        if extra is not None:
            val = extra(val)
        ax.title.set_text("{:.5G}".format(val) if not isinstance(val, str) else val)
        
def list_format(e):
    """Can be used to format list-like return values in apply"""
    return "[" + ", ".join(["{:.5G}".format(f) for f in e]) + "]"

def list_format_tuple(e):
    """Can be used to format list-like return values in apply"""
    return "[" + ", ".join(["{:.5G}".format(f[1]) for f in e]) + "]"

def list_format_dict_values(e):
    """Can be used to format list-like return values in apply"""
    return "[" + ", ".join(["{:.5G}".format(f) for f in e.values()]) + "]"

DATA = pd.read_csv(os.path.join(UCR_ARCHIVE, UCR_DATASET, f"{UCR_DATASET}_TEST.tsv"), sep="\t", header=None, index_col=0)
DATA, LABELS = read_ucr_file(os.path.join(UCR_ARCHIVE, UCR_DATASET, f"{UCR_DATASET}_TEST.tsv"))

Normalization

Specify which normalized features and which parameters you want to investigate. This step is only required if normalized features are required (via NormalizedTimeSeriesCharacteristics).

  • params: The parameters of the features you want to normalize for investigation.
  • features_to_normalize: The features (including the above parameters) you want to normalize for investigation.
  • funcs_to_merge: The group of parameterized features that should be normalized in bulk, rather than individually.
In [4]:
tsc = TimeSeriesCharacteristics()

params = dict(
    block_sizes=[30, 60],
    quantiles=[0.25, 0.5, 0.75],
    periods=["1h", "2h"],
    periodogram_agg_funcs=["min", ("quantile", dict(q=0.25)), "median", ("quantile", dict(q=0.75)), "max"],
    lags=[1, 2, 3]
)
features_to_normalize = [
    # distributional features
    tsc.kurtosis,
    tsc.skewness,
    tsc.shift,
    *[(tsc.lumpiness, dict(block_size=b)) for b in params["block_sizes"]],
    *[(tsc.quantile, dict(q=q)) for q in params["quantiles"]],
    tsc.ratio_large_standard_deviation,

    # temporal features
    tsc.mean_second_derivative_central,
    *[(tsc.level_shift, dict(block_size=b)) for b in params["block_sizes"]],
    *[(tsc.variance_change, dict(block_size=b)) for b in params["block_sizes"]],
    (tsc.periodicity, dict(dt_min=1.0, periods=params["periods"])),
    (tsc.agg_periodogram, dict(dt_min=1.0, funcs=params["periodogram_agg_funcs"])),
    *[(tsc.time_reversal_asymmetry_statistic, dict(lag=lag)) for lag in params["lags"]],
    tsc.linear_trend_slope,
    (tsc.agg_linear_trend_slope, dict(block_sizes=params["block_sizes"])),
    *[(tsc.c3, dict(lag=lag)) for lag in params["lags"]],

    # complexity features
    *[(tsc.kullback_leibler_score, dict(block_size=b)) for b in params["block_sizes"]],
    tsc.cid_ce
]
funcs_to_merge = [
    "quantile",
    "periodicity",
    "agg_periodogram",
    "time_reversal_asymmetry_statistic",
    "c3"
]

ntsc = NormalizedTimeSeriesCharacteristics()
min_max_df = calculate_min_max_df(DATA, features_to_normalize, quantiles=[0.05, 0.95], index_levels=["entity"], column_value="value", verbose=True)
ntsc.init(features=features_to_normalize, min_max_df=min_max_df, funcs_to_merge=funcs_to_merge, column_min="0.05", column_max="0.95")


1. Distributional Features

Distributional features do not depend upon the temporal structure of the data, i.e., the view the time series as a set of unordered values.

Measures of Dispersion

In [5]:
apply(tsc.kurtosis)
apply(ntsc.kurtosis)
In [6]:
apply(tsc.skewness)
apply(ntsc.skewness)
In [7]:
apply(tsc.shift)
apply(ntsc.shift)

Measures of block-wise Dispersion

In [8]:
apply(tsc.lumpiness, block_size=params["block_sizes"][0])
apply(ntsc.lumpiness, block_size=params["block_sizes"][0])
In [9]:
apply(tsc.stability, block_size=params["block_sizes"][0])
apply(ntsc.stability, block_size=params["block_sizes"][0])

Measures on the Number of Duplicates

  • 🟩 Normalized number of maxima duplicates indicates the percentage of duplicate values that have the maximum value of the data
    • Range: $[0,~1]$
  • 🟩 Normalized number of minima duplicates indicates the percentage of duplicate values that have the minimum value of the data
    • Range: $[0,~1]$
  • 🟩 Percentage of reoccurring datapoints to all datapoints len(different values occurring more than once) / len(different values)
    • Range: $[0,~1]$
  • 🟩 Percentage of reoccurring values to all values # of data points occurring more than once / # values
    • Range: $[0,~1]$
  • 🟩 Percentage of unique values quantifies the number of unique values over the time series length # unique values / # values
    • Range: $[0,~1]$
In [10]:
apply(tsc.normalized_duplicates_max)
apply(ntsc.normalized_duplicates_max)
In [11]:
apply(tsc.normalized_duplicates_min)
apply(ntsc.normalized_duplicates_min)
In [12]:
apply(tsc.percentage_of_reoccurring_datapoints)
apply(ntsc.percentage_of_reoccurring_datapoints)
In [13]:
apply(tsc.percentage_of_reoccurring_values)
apply(ntsc.percentage_of_reoccurring_values)
In [14]:
apply(tsc.percentage_of_unique_values)
apply(ntsc.percentage_of_unique_values)

Measures on the Distribution

In [15]:
for q in params["quantiles"]:
    print(f"quantile q = {q:.2f}")
    apply(tsc.quantile, q=q)
    apply(ntsc.quantile, q=q)
    plt.show()
quantile q = 0.25
quantile q = 0.50
quantile q = 0.75
In [16]:
for r in np.arange(1, 4):
    print(f"ratio beyond r = {r:.2f} x sigma")
    apply(tsc.ratio_beyond_r_sigma, r=r)
    apply(ntsc.ratio_beyond_r_sigma, r=r)
    plt.show()
ratio beyond r = 1.00 x sigma
ratio beyond r = 2.00 x sigma
ratio beyond r = 3.00 x sigma
In [17]:
apply(tsc.ratio_large_standard_deviation)
apply(ntsc.ratio_large_standard_deviation)

2. Temporal Features

Temporal features take into account the temporal dependency of data points, i.e., they observe the frequency spectrum, seasonalitites, correlations with the time axis, ...

Measures of Temporal Dispersion

In [18]:
apply(tsc.mean_abs_change)
apply(ntsc.mean_abs_change)
In [19]:
apply(tsc.mean_second_derivative_central)
apply(ntsc.mean_second_derivative_central)

Measures of block-wise Temporal Dispersion

In [20]:
apply(tsc.level_shift, block_size=params["block_sizes"][0])
apply(ntsc.level_shift, block_size=params["block_sizes"][0])
In [21]:
apply(tsc.variance_change, block_size=params["block_sizes"][0])
apply(ntsc.variance_change, block_size=params["block_sizes"][0])

Measures of Temporal Similarity

  • 🟥 Hurst is a measure of long-term memory of a time series, related to auto-correlation (related: Detrended Fluctuation Analysis DFA) - https://en.wikipedia.org/wiki/Hurst_exponent
    • Range: $[0, 1]$ (if properly clamped)
  • 🟩 Auto-Correlation (ACF) is the correlation of a signal with a lagged version of itself (related: Correlogram) - https://en.wikipedia.org/wiki/Autocorrelation
    • Range: $[0, 1]$ because we normalize the original $[-1, 1]$ range
In [22]:
apply(tsc.hurst)
apply(ntsc.hurst)
In [23]:
for lag in params["lags"]:
    print(f"lag = {lag:d}")
    apply(tsc.autocorrelation, lag=lag)
    apply(ntsc.autocorrelation, lag=lag)
    plt.show()
lag = 1
lag = 2
lag = 3

Measures in the Frequency Spectrum

In [24]:
for period in params["periods"]:
    print(f"periodicity {period}")
    apply(tsc.periodicity, dtindex=True, extra=lambda e: e[0], periods=[period])
    apply(ntsc.periodicity, dtindex=True, extra=lambda e: e[0][1], periods=[period])
    plt.show()
periodicity 1h
periodicity 2h
In [25]:
apply(tsc.agg_periodogram, funcs=params["periodogram_agg_funcs"], nplots=4, dtindex=True, extra=list_format_tuple)
apply(ntsc.agg_periodogram, funcs=params["periodogram_agg_funcs"], nplots=4, dtindex=True, extra=list_format_tuple)
In [26]:
print("slope")
apply(tsc.linear_trend_slope)
apply(ntsc.linear_trend_slope)
plt.show()

print("r^2")
apply(tsc.linear_trend_rvalue2)
apply(ntsc.linear_trend_rvalue2)
plt.show()
slope
r^2
In [27]:
print("variance of slopes with chunk size = {}".format(params["block_sizes"]))
apply(tsc.agg_linear_trend_slope, block_sizes=params["block_sizes"], extra=list_format_tuple)
apply(ntsc.agg_linear_trend_slope, block_sizes=params["block_sizes"], extra=list_format_tuple)
plt.show()

print("mean of r^2 values with chunk size = {}".format(params["block_sizes"]))
apply(tsc.agg_linear_trend_rvalue2, block_sizes=params["block_sizes"], extra=list_format_tuple)
apply(ntsc.agg_linear_trend_rvalue2, block_sizes=params["block_sizes"], extra=list_format_tuple)
plt.show()
variance of slopes with chunk size = [30, 60]
mean of r^2 values with chunk size = [30, 60]
In [28]:
for lag in params["lags"]:
    print(f"lag = {lag:d}")
    apply(tsc.c3, lag=lag)
    apply(ntsc.c3, lag=lag)
    plt.show()
lag = 1
lag = 2
lag = 3
In [29]:
for lag in params["lags"]:
    print(f"lag = {lag:d}")
    apply(tsc.time_reversal_asymmetry_statistic, lag=lag)
    apply(ntsc.time_reversal_asymmetry_statistic, lag=lag)
    plt.show()
lag = 1
lag = 2
lag = 3

3. Complexity Features

Complexity Features measure the "randomness" of a time series, its entropy, etc.

Measures of Entropy

In [30]:
apply(tsc.binned_entropy, max_bins=10)
apply(ntsc.binned_entropy, max_bins=10)
In [31]:
apply(tsc.kullback_leibler_score, block_size=params["block_sizes"][0])
apply(ntsc.kullback_leibler_score, block_size=params["block_sizes"][0])
In [32]:
apply(tsc.index_of_kullback_leibler_score, block_size=params["block_sizes"][0])
apply(tsc.index_of_kullback_leibler_score, block_size=params["block_sizes"][0])

Measures of (miscellaneous) Complexity and Permutation

In [33]:
apply(tsc.cid_ce)
apply(ntsc.cid_ce)
In [34]:
# not implemented due to NDA
# apply(tsc.permutation_analysis)
# apply(ntsc.permutation_analysis)
In [35]:
apply(tsc.swinging_door_compression_rate, eps=0.1)
apply(ntsc.swinging_door_compression_rate, eps=0.1)

Measures of Flatness

  • 🟩 Normalized Number of Crossing Points is the number (or, percentage) of times a time series crosses the mean line (related: Fickleness by Matijas et al. https://www.sciencedirect.com/science/article/pii/S095741741300078X)
    • Range: $[0,~1]$
  • 🟩 Normalized Count above Mean is the percentage of values that is higher than the mean
    • Range: $[0, 1]$
  • 🟩 Normalized Count below Mean is the percentage of values that is lower than the mean
    • Range: $[0, 1]$
  • 🟩 Normalized Longest Strike above Mean is the relative length of the longest series of consecutive data points above the mean
    • Range: $[0, 1]$
  • 🟩 Normalized Longest Strike below Mean is the relative length of the longest series of consecutive data points below the mean
    • Range: $[0, 1]$
  • 🟥 Flat Spots measures the maximum run-length of values, if they are divided into bins (either linear, or quantile-based) - https://ieeexplore.ieee.org/abstract/document/7395871
In [36]:
apply(tsc.normalized_crossing_points)
apply(ntsc.normalized_crossing_points)
In [37]:
apply(tsc.normalized_above_mean)
apply(ntsc.normalized_above_mean)
In [38]:
apply(tsc.normalized_below_mean)
apply(ntsc.normalized_below_mean)
In [39]:
apply(tsc.normalized_longest_strike_above_mean)
apply(ntsc.normalized_longest_strike_above_mean)
In [40]:
apply(tsc.normalized_longest_strike_below_mean)
apply(ntsc.normalized_longest_strike_below_mean)
In [41]:
apply(tsc.flat_spots, nplots=2, extra=list_format_dict_values)
apply(ntsc.flat_spots, nplots=2, extra=list_format_dict_values)

Measures of Peaks/Peakiness

In [42]:
apply(tsc.normalized_number_peaks, n=5)
apply(ntsc.normalized_number_peaks, n=5)
In [43]:
apply(tsc.step_changes, window_len=60)
apply(ntsc.step_changes, window_len=60)

Stationarity and Unit Roots

In [44]:
apply(tsc.adf)
apply(ntsc.adf)
In [45]:
apply(tsc.kpss)
apply(ntsc.kpss)

Dropped Features

tsfresh: abs_energy()
  • 🟨 Absolute Energy is the area under the squared magnitude of the series - https://en.wikipedia.org/wiki/Energy_(signal_processing)
    • Group: Measures of Energy
    • Range: $[0,~\inf]$ (if z-normalized, this is always $n$ ⚠️ do not use on z-normalized data)
  • we have z-normalized data so we cannot use this
tsfresh: standard_deviation()
  • 🟨 Standard Deviation is a measure of variance (related: Variance) - https://en.wikipedia.org/wiki/Standard_deviation
    • Group: Measures of Dispersion
    • Range: $[-\inf,~\inf]$ (if z-normalized, this is always $1$ ⚠️ do not use on z-normalized data)
  • we have z-normalized data so we cannot use this
tsfresh: partial_autocorrelation(...)
  • 🟥 Partial Auto-Correlation (PACF) is the same as ACF but with a "removal" of the correlation of smaller lags - https://en.wikipedia.org/wiki/Partial_autocorrelation_function
    • Group: Measures of Temporal Self-Similarity
    • Range: $[0, 1]$ because we normalize the original $[-1, 1]$ range
  • Very expensive + the implementation yields unreliable outputs often outside the expected data range of $[-1, 1]$
tsfresh: ar_coefficient(...)
tsfresh: change_quantiles()
  • Pre-processing, no feature.
tsfresh: first_location_of_maximum() first_location_of_minimum() last_location_of_maximum() last_location_of_minimum()
  • The exact location of maxima and minima is not relevant in our domain.
tsfresh: friedrich_coefficients() max_langevin_fixed_point()
tsfresh: index_mass_quantile() quantile()
  • Pre-processing, no feature.
tsfresh: range_count() value_count()
  • Helper function, no feature.
tsfresh: large_standard_deviation()
  • Boolean feature, use a ratio instead.
tsfresh: length()
  • All our time series have the same length.
tsfresh: number_crossing_m()
  • Helper function, no feature.
tsfresh: sum_of_reoccurring_data_points() sum_of_reoccurring_values() sum_values()
  • Our flat spot measures and duplicate measures already cover these in a more general way.
tsfresh: num_duplicates() num_duplicates_max() num_duplicates_min()
  • We use the normalized variants of these.
tsfresh: has_duplicate() has_duplicate_max() has_duplicate_min()
  • Boolean features, use a ratio instead.
tsfresh: variance_larger_than_standard_deviation()
  • Boolean features, use a ratio instead.
tsfresh: mean() median() maximum() minimum()
  • Not useful on scaled data
tsfresh: mean_change()
  • Compute (a[-1] - a[0]) / (len(a) - 1).
  • Not meaningful, already covered by trend measures.
tsfresh: absolute_sum_of_changes()
  • This is the same as mean_abs_change(), but unnormalized
tsfresh: agg_autocorrelation(x, param)
  • Computationally too expensive.
tsfresh: spkt_welch_density() fft_coefficient()
  • Already covered by periodicity metrics
tsfresh: linear_trend_timewise()
  • Only relevent if our time series would have gaps or would not be evenly spaced
tsfresh: approximate_entropy(x, m, r) / sample_entropy(x)
  • Computationally too expensive. We use binned_entropy(x, max_bins) as an alternative.
tsfresh: number_cwt_peaks()
  • CWT Peaks smooths the signal with a window and counts the remaining peaks that are higher than the SNR.
  • Computationally too expensive. We use number_peaks(x, n) as an alternative.
Index of Disperion
  • 🟨 Index of Dispersion is a measure of burstiness (related: Fano Factor, Variance-to-Mean Ratio) - https://en.wikipedia.org/wiki/Index_of_dispersion

    • Group: measures of dispersion
    • Range: $[-\inf,~\inf]$ (if z-normalized, this is always $\inf$ ⚠️ do not use on z-normalized data)
    • used by Talagala et al. http://dx.doi.org/10.1080/10618600.2019.1617160
    • code:

        def index_of_dispersion(data):
            std = fc.standard_deviation(data)
            mu = np.mean(data)
            return std ** 2 / mu if mu != 0 else np.nan
      
        apply(index_of_dispersion)
  • we have z-normalized data so we cannot use this
Moving Average (MA) coefficient (already implemented: ma_coefficients(...))
Exogenous Features
  • We are only interested in the raw feature and do not incorporate domain factors or exogenous factors initially.
Auto-Correlation of higher orders
  • Already covered by Correlogram.
Highest Auto-Correlation
  • Already covered by Correlogram.
Granularity
  • We assume a fixed difference in our domain (1 minute).
Non-Linearity
  • Teräsvirta’s neural network test for nonlinearity / nonlinearity is estimated by generating surrogate time series as the realisation of the null hypothesis that the series is linear.
  • If the delay vector representations of original and surrogate series are significantly different, the time series is considered to be nonlinear
  • We don't want to fit a neural network as a feature (too complex).
Local Variation Features
  • IQR of Lag-1-Difference and Correlation between time series and Lag-1-Difference
  • Already covered by Correlogram.
Normalized Power of Specified Frequency
  • Already covered by Periodogram.
Curvature
  • Already covered by distributional features.
Variance
  • Already covered by standard deviation.
Symmetry
Traversity
  • Traversity is the standard deviation of the difference between time-series y and y_per where y_n is the n-th ACF lag of y
  • Seems arbitrary and might already be covered by Correlogram.
Decompositions (Peaks, Troughs, Spikiness, ETS, ARIMA, Seasonalities, Trends)
  • STL is problematic since it requires the main period for the seasonal part, which we do not know.
  • Computationally too expensive. We don't want to fit a model.
Predictability
  • We don't want to fit a model.
Statistical Tests (Elliott-Rothenberg-Stock, Schmidt-Philips, Philips-Perron (PP), Zivot-Andrews, Durbin-Watson)
  • We already calculate features for some of them that already quantify the result.
  • Many of them are very complicated, we only use ADF and KPSS tests.
Dynamic Time Wraping (DTW)
  • Computationally too expensive.
SAX Quadruple
  • We don't want parameterized compression algorithms
Fickleness
Turning Points
  • A turning point for series is given if y_i is a local maximum or minimum for its two closest neighbours
  • This is similar to normalized_number_peaks(n) with n = 2
Detrended Fluctuation Analysis (DFA)
  • Computationally too expensive.
Lyapunov (Chaos)
  • Computationally too expensive.
Correlation Dimension
  • Computationally too expensive.
In [ ]: